Ryan Tibshirani
Statistics and Machine Learning
Carnegie Mellon University
May 19, 2021
I can’t cover all of this! I’ll focus on our API and give some basic data demos (reproducible: all code included) then reflect on a few lessons learned
Outline:
The COVIDcast API is based on HTTP GET queries and returns data in JSON or CSV format
| Parameter | Description | Examples |
|---|---|---|
data_source |
data source | doctor-visits or fb-survey |
signal |
signal derived from data source | smoothed_cli or smoothed_adj_cli |
time_type |
temporal resolution of the signal | day or week |
geo_type |
spatial resolution of the signal | county, hrr, msa, or state |
time_values |
time units over which events happened | 20200406 or 20200406-20200410 |
geo_value |
location codes, depending on geo_type |
* for all, or pa for Pennsylvania |
We also provide R and Python packages for API access. Highlights:
(Have an idea? File an issue or contribute a PR on our public GitHub repo)
library(covidcast)
covidcast_meta() %>%
group_by(data_source, signal) %>%
summarize(county = ifelse("county" %in% geo_type, "*", ""),
msa = ifelse("msa" %in% geo_type, "*", ""),
hrr = ifelse("hrr" %in% geo_type, "*", ""),
state = ifelse("state" %in% geo_type, "*", "")) %>%
mutate(signal = ifelse(nchar(signal) <= 35, signal,
paste0(substr(signal, 1, 32), "..."))) %>%
slice(grep("(raw|7dav|\\_w)", signal, invert = TRUE)) %>%
as.data.frame() %>%
print(right = FALSE, row.names = FALSE)## data_source signal county msa hrr state
## chng smoothed_adj_outpatient_cli * * * *
## chng smoothed_adj_outpatient_covid * * * *
## chng smoothed_outpatient_cli * * * *
## chng smoothed_outpatient_covid * * * *
## covid-act-now pcr_specimen_positivity_rate * * * *
## covid-act-now pcr_specimen_total_tests * * * *
## doctor-visits smoothed_adj_cli * * * *
## doctor-visits smoothed_cli * * * *
## fb-survey smoothed_accept_covid_vaccine * * * *
## fb-survey smoothed_anxious_5d * * * *
## fb-survey smoothed_anxious_7d * * * *
## fb-survey smoothed_cli * * * *
## fb-survey smoothed_covid_vaccinated * * * *
## fb-survey smoothed_covid_vaccinated_or_accept * * * *
## fb-survey smoothed_depressed_5d * * * *
## fb-survey smoothed_depressed_7d * * * *
## fb-survey smoothed_dontneed_reason_dont_sp... * * * *
## fb-survey smoothed_dontneed_reason_had_covid * * * *
## fb-survey smoothed_dontneed_reason_not_ben... * * * *
## fb-survey smoothed_dontneed_reason_not_hig... * * * *
## fb-survey smoothed_dontneed_reason_not_ser... * * * *
## fb-survey smoothed_dontneed_reason_other * * * *
## fb-survey smoothed_dontneed_reason_precaut... * * * *
## fb-survey smoothed_felt_isolated_5d * * * *
## fb-survey smoothed_felt_isolated_7d * * * *
## fb-survey smoothed_hesitancy_reason_allergic * * * *
## fb-survey smoothed_hesitancy_reason_cost * * * *
## fb-survey smoothed_hesitancy_reason_dislik... * * * *
## fb-survey smoothed_hesitancy_reason_distru... * * * *
## fb-survey smoothed_hesitancy_reason_distru... * * * *
## fb-survey smoothed_hesitancy_reason_health... * * * *
## fb-survey smoothed_hesitancy_reason_ineffe... * * * *
## fb-survey smoothed_hesitancy_reason_low_pr... * * * *
## fb-survey smoothed_hesitancy_reason_not_re... * * * *
## fb-survey smoothed_hesitancy_reason_other * * * *
## fb-survey smoothed_hesitancy_reason_pregnant * * * *
## fb-survey smoothed_hesitancy_reason_religious * * * *
## fb-survey smoothed_hesitancy_reason_sideef... * * * *
## fb-survey smoothed_hesitancy_reason_unnece... * * * *
## fb-survey smoothed_hh_cmnty_cli * * * *
## fb-survey smoothed_ili * * * *
## fb-survey smoothed_inperson_school_fulltime * * * *
## fb-survey smoothed_inperson_school_parttime * * * *
## fb-survey smoothed_large_event_1d * * * *
## fb-survey smoothed_large_event_indoors_1d * * * *
## fb-survey smoothed_nohh_cmnty_cli * * * *
## fb-survey smoothed_others_masked * * * *
## fb-survey smoothed_public_transit_1d * * * *
## fb-survey smoothed_received_2_vaccine_doses * * * *
## fb-survey smoothed_restaurant_1d * * * *
## fb-survey smoothed_restaurant_indoors_1d * * * *
## fb-survey smoothed_screening_tested_positi... * * * *
## fb-survey smoothed_shop_1d * * * *
## fb-survey smoothed_shop_indoors_1d * * * *
## fb-survey smoothed_spent_time_1d * * * *
## fb-survey smoothed_spent_time_indoors_1d * * * *
## fb-survey smoothed_tested_14d * * * *
## fb-survey smoothed_tested_positive_14d * * * *
## fb-survey smoothed_travel_outside_state_5d * * * *
## fb-survey smoothed_travel_outside_state_7d * * * *
## fb-survey smoothed_vaccine_likely_doctors * * * *
## fb-survey smoothed_vaccine_likely_friends * * * *
## fb-survey smoothed_vaccine_likely_govt_health * * * *
## fb-survey smoothed_vaccine_likely_local_he... * * * *
## fb-survey smoothed_vaccine_likely_politicians * * * *
## ght smoothed_search * * *
## google-survey smoothed_cli * * * *
## google-symptoms ageusia_smoothed_search * * * *
## google-symptoms anosmia_smoothed_search * * * *
## google-symptoms sum_anosmia_ageusia_smoothed_search * * * *
## hhs confirmed_admissions_1d *
## hhs confirmed_admissions_covid_1d *
## hhs sum_confirmed_suspected_admissio... *
## hhs sum_confirmed_suspected_admissio... *
## hospital-admissions smoothed_adj_covid19 * * * *
## hospital-admissions smoothed_adj_covid19_from_claims * * * *
## hospital-admissions smoothed_covid19 * * * *
## hospital-admissions smoothed_covid19_from_claims * * * *
## indicator-combination confirmed_cumulative_num * * * *
## indicator-combination confirmed_cumulative_prop * * * *
## indicator-combination confirmed_incidence_num * * * *
## indicator-combination confirmed_incidence_prop * * * *
## indicator-combination deaths_cumulative_num * * * *
## indicator-combination deaths_cumulative_prop * * * *
## indicator-combination deaths_incidence_num * * * *
## indicator-combination deaths_incidence_prop * * * *
## indicator-combination nmf_day_doc_fbc_fbs_ght * * *
## indicator-combination nmf_day_doc_fbs_ght * * *
## jhu-csse confirmed_cumulative_num * * * *
## jhu-csse confirmed_cumulative_prop * * * *
## jhu-csse confirmed_incidence_num * * * *
## jhu-csse confirmed_incidence_prop * * * *
## jhu-csse deaths_cumulative_num * * * *
## jhu-csse deaths_cumulative_prop * * * *
## jhu-csse deaths_incidence_num * * * *
## jhu-csse deaths_incidence_prop * * * *
## nchs-mortality deaths_allcause_incidence_num *
## nchs-mortality deaths_allcause_incidence_prop *
## nchs-mortality deaths_covid_and_pneumonia_notfl... *
## nchs-mortality deaths_covid_and_pneumonia_notfl... *
## nchs-mortality deaths_covid_incidence_num *
## nchs-mortality deaths_covid_incidence_prop *
## nchs-mortality deaths_flu_incidence_num *
## nchs-mortality deaths_flu_incidence_prop *
## nchs-mortality deaths_percent_of_expected *
## nchs-mortality deaths_pneumonia_notflu_incidenc... *
## nchs-mortality deaths_pneumonia_notflu_incidenc... *
## nchs-mortality deaths_pneumonia_or_flu_or_covid... *
## nchs-mortality deaths_pneumonia_or_flu_or_covid... *
## quidel covid_ag_smoothed_pct_positive * * * *
## quidel smoothed_pct_negative * *
## quidel smoothed_tests_per_device * *
## safegraph bars_visit_num * * * *
## safegraph bars_visit_prop * * * *
## safegraph completely_home_prop * * * *
## safegraph median_home_dwell_time * * * *
## safegraph restaurants_visit_num * * * *
## safegraph restaurants_visit_prop * * * *
## usa-facts confirmed_cumulative_num * * * *
## usa-facts confirmed_cumulative_prop * * * *
## usa-facts confirmed_incidence_num * * * *
## usa-facts confirmed_incidence_prop * * * *
## usa-facts deaths_cumulative_num * * * *
## usa-facts deaths_cumulative_prop * * * *
## usa-facts deaths_incidence_num * * * *
## usa-facts deaths_incidence_prop * * * *
## youtube-survey smoothed_cli *
## youtube-survey smoothed_ili *
How many COVID-19 deaths have been reported per day, in my state, since March 1?
start_day = "2020-03-01"
end_day = "2021-04-28"
deaths = covidcast_signal(data_source = "usa-facts",
signal = "deaths_7dav_incidence_num",
start_day = start_day, end_day = end_day,
geo_type = "state", geo_values = "pa")
plot(deaths, plot_type = "line",
title = "New COVID-19 deaths in PA (7-day average)") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
theme(legend.position = "none")What percentage of daily hospital admissions are due to COVID-19 in PA, NY, TX?
hosp = covidcast_signal(data_source = "hospital-admissions",
signal = "smoothed_adj_covid19_from_claims",
start_day = start_day, end_day = end_day,
geo_type = "state", geo_values = c("pa", "ny", "tx"))
plot(hosp, plot_type = "line",
title = "% of hospital admissions due to COVID-19") +
geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)),
method = "last.bumpup") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
theme(legend.position = "none")What does the current COVID-19 cumulative case rate look like, nationwide?
cases = covidcast_signal(data_source = "usa-facts",
signal = "confirmed_cumulative_prop",
start_day = end_day, end_day = end_day)
end_day_str = format.Date(end_day, "%B %d %Y")
plot(cases, title = "Cumulative COVID-19 cases per 100,000 people",
range = c(0, 12500),
choro_params = list(subtitle = end_day_str, legend_n = 6))How do some cities compare in terms of doctor’s visits due to COVID-like illness?
dv = covidcast_signal(data_source = "doctor-visits",
signal = "smoothed_adj_cli",
start_day = start_day, end_day = end_day,
geo_type = "msa",
geo_values = name_to_cbsa(c("Miami", "New York",
"Pittsburgh", "San Antonio")))
plot(dv, plot_type = "line",
title = "% of doctor's visits due to COVID-like illness") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
scale_color_hue(labels = cbsa_to_name(unique(dv$geo_value)))How do my county and my friend’s county compare in terms of COVID symptoms?
sympt = covidcast_signal(data_source = "fb-survey",
signal = "smoothed_hh_cmnty_cli",
start_day = "2020-04-15", end_day = end_day,
geo_values = c(name_to_fips("Allegheny"),
name_to_fips("Fulton",
state = "GA")))
plot(sympt, plot_type = "line",
title = "% of people who know somebody with COVID symptoms") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
scale_color_hue(labels = fips_to_name(unique(sympt$geo_value)))How do some states compare in terms of self-reported mask useage?
states = c("dc", "ma", "ny", "wy", "sd", "id")
mask1 = covidcast_signal(data_source = "fb-survey",
signal = "smoothed_wwearing_mask",
start_day = "2020-09-15", end_day = "2021-02-10",
geo_type = "state", geo_values = states)
mask2 = covidcast_signal(data_source = "fb-survey",
signal = "smoothed_wwearing_mask_7d",
start_day = "2021-02-11", end_day = end_day,
geo_type = "state", geo_values = states)
mask = rbind(mask1, mask2)
plot(mask, plot_type = "line",
title = "% of people who wear masks in public most/all the time") +
geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)),
method = "last.bumpup") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
theme(legend.position = "none")How about vaccine uptake (self-reported), and willingness to take vaccine (if not yet vaccinated)?
states = c("dc", "ma", "ny", "wy", "sd", "id")
vaccine = covidcast_signals(data_source = "fb-survey",
signal = c("smoothed_wcovid_vaccinated",
"smoothed_waccept_covid_vaccine"),
start_day = "2021-01-15", end_day = end_day,
geo_type = "state", geo_values = states)
g1 = plot(vaccine[[1]], plot_type = "line",
title = "% of people who have received COVID-19 vaccine, self-reported") +
geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)),
method = "last.bumpup") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
theme(legend.position = "none")
g2 = plot(vaccine[[2]], plot_type = "line",
title = "% of people who would accept COVID-19 vaccine, if haven't yet") +
geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)),
method = "last.bumpup") +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
theme(legend.position = "none")
grid.arrange(g1, g2, nrow = 1)By default the API returns the most recent data for each time_value. We also provide access to all previous versions of the data, using the following optional parameters:
| Parameter | To get data … | Examples |
|---|---|---|
as_of |
as if we queried the API on a particular date | 20200406 |
issues |
published at a particular date or date range | 20200406 or 20200406-20200410 |
lag |
published a certain number of time units after events occured | 1 or 3 |
Why would we need this? Because many data sources are subject to revisions:
This presents a challenge to modelers: e.g., we have to learn how to forecast based on the data we’d have at the time, not updates that would arrive later. To accommodate, we log revisions even when the original data source does not!
The last two weeks of August in CA …
# Let's get the data that was available as of 09/22, for the end of August in CA
dv = covidcast_signal(data_source = "doctor-visits",
signal = "smoothed_adj_cli",
start_day = "2020-08-15", end_day = "2020-08-31",
geo_type = "state", geo_values = "ca",
as_of = "2020-09-21")
# Plot the time series curve
xlim = c(as.Date("2020-08-15"), as.Date("2020-09-21"))
ylim = c(3.83, 5.92)
ggplot(dv, aes(x = time_value, y = value)) +
geom_line() +
coord_cartesian(xlim = xlim, ylim = ylim) +
geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
theme_bw() + theme(legend.position = "bottom")The last two weeks of August in CA …
# Now loop over a bunhch of "as of" dates, fetch data from the API for each one
as_ofs = seq(as.Date("2020-09-01"), as.Date("2020-09-21"), by = "week")
dv_as_of = map_dfr(as_ofs, function(as_of) {
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-08-15", end_day = "2020-08-31",
geo_type = "state", geo_values = "ca", as_of = as_of)
})
# Plot the time series curve "as of" September 1
dv_as_of %>%
filter(issue == as.Date("2020-09-01")) %>%
ggplot(aes(x = time_value, y = value)) +
geom_line(aes(color = factor(issue))) +
coord_cartesian(xlim = xlim, ylim = ylim) +
geom_vline(aes(color = factor(issue), xintercept = issue), lty = 2) +
labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
geom_line(data = dv, aes(x = time_value, y = value)) +
geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
theme_bw() + theme(legend.position = "none")The last two weeks of August in CA …
dv_as_of %>%
ggplot(aes(x = time_value, y = value)) +
geom_line(aes(color = factor(issue))) +
coord_cartesian(xlim = xlim, ylim = ylim) +
geom_vline(aes(color = factor(issue), xintercept = issue), lty = 2) +
labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
geom_line(data = dv, aes(x = time_value, y = value)) +
geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
theme_bw() + theme(legend.position = "none")Through recruitment partnership with Facebook, we survey about 50,000 people daily (and over 20 million since it began in April), in the US. Topics include:
This is the largest non-Census research survey ever conducted. Raw survey response data is available to researchers under a data use agreement. A parallel, international effort by the University of Maryland reaches 100+ countries in 55 languages
An attempt to distill some lessons learned from the past year, related to statistical modeling and machine learning, broken down by three areas:
The COVID-19 Forecast Hub collects short-term forecasts of incident COVID-19 cases, hospitalizations, and deaths. These are made by 50+ groups of “citizen scientists”, and power the CDC’s official communications on COVID-19 forecasting
This is not an easy problem:
(All of this—plus an additional model-level nonstationarity—carries over to building an ensemble model!)
Only a small handful of models consistently outperform the baseline (essentially the flat-line forecaster). For example, from Cramer et al. (2021):
Lessons/reflections:
Nowcasting: estimating the value of a signal that will only be fully-observed at a later date. Current data is partial/noisy, but progressively improves as time passes
Example: suppose we want to use medical insurance claims to estimate how many people have some disease on some day (in some location)
Meanwhile, in COVID-19, it’s even more complicated:
Even settling for the penultimate bullet, we would be nowcasting a latent variable (never observed)
Lessons/reflections:
While time scales may change, nowcasting is not going away as a central problem in public health …
The beginning of the pandemic created a clear pull for computational scientists: fetch case and death data from JHU CSSE’s GitHub, learn about SIR modeling, inject stochasticity, start making forecasts
We decided early on to swim against the stream. It’s not that this work wasn’t important, but rather, we felt we could create greater value by working on the data problem (to hopefully benefit many others)
We wouldn’t/couldn’t have taken this risk if there weren’t so many strong computational scientists who jumped into work on forecasting
It can be hard to quantify the value of good data. We will be trying to do this for years to come (not just us/our data … this is an important undertaking for the whole scientific community)
That said, we are starting to see (in retrospect) some encouraging results in problems where you can quantify value, like forecasting and nowcasting
For more, visit https://covidcast.cmu.edu (you’ll find everything linked from there)